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Abstract 

Using classical simulated annealing to maximise a function tp defined 
on a subset of R'', the probability ¥{ip{dn) < i/'max — s) tends to zero 
at a logarithmic rate as n increases; here 9n is the state in the n-th 
stage of the simulated annealing algorithm and 'i/'max is the maximal value 
of i/). We propose a modified scheme for which this probability is of 
order n~^^^ logn, and hence vanishes at an algebraic rate. To obtain this 
faster rate, the exponentially decaying acceptance probability of classical 
simulated annealing is replaced by a more heavy-tailed function, and the 
system is cooled faster. We also show how the algorithm may be applied 
to functions that cannot be computed exactly but only approximated, and 
give an example of maximising the log-likelihood function for a state-space 
model. 

Keywords : Central limit and other weak theorems, Computational methods 
in Markov chains, Sequential estimation, Markov processes with continuous pa- 
rameter, Monte Carlo methods, Stochastic programming. 

MSC : 60F05, 60J22, 60J25, 62L12, 65C05, 82C80, 90C15. 

1 Introduction 

Simulated annealing is a simulation-based approach to the problem of optimising 
a function. In the present paper we will be concerned with a real- valued function, 
ip say, defined on a subset O of M'', and our aim is to maximise tp- Thus 
we assume that "0 is bounded and that its supremum is attained at least at 
one point. Simulated annealing is designed to find the global maximum of ip, 
even if ijj has local maxima . It has been e xten sively studied, see for instance 
iDel Moral and Miclol lUflOfll). ICatonil lll999l) andlCot and Catonil l|l998l) among 
many others, and lBartoli and Del Moral! I)200lj) for an elementary introduction 
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to the subject. The classical simulated annealing algorithm departs from a 
Markov transition kernel, which we denote by K{-,-), on O, and a positive 
sequence {f3n)n>o increasing to infinity. The sequence is often referred to as 
an (inverse) cooling schedule, because is often interpreted as a temperature; 
this terminology originates from statistical physics. Then, starting from an 
initial point Oq £ Q, a, sequence {9n)n>o is constructed recursively as follows. 

(al) In stage n, given the current state sample a new proposed position Z 
from K{0n,-). 

(a2) Set 9n+i = Z with probability 

and 9n+i = On otherwise. 

Here (•)+ is the positive part. We notice that if ip{Z) > ip{dn), then the proposed 
new state Z is accepted with probability one. A proposal Z at which ip is smaller 
than at the current 0„ may be accepted, but this becomes increasingly unlikely 
for large n since /?„ — s- cx). 

The basic idea of simulated annealing is as follows. The update rule above 
corresponds to a Markov transition kernel, Kp say, on Q; cf. (|2.6|l below. Under 
additional assumptions including that K is positive recurrent and reversible 
with respect to its stationary distribution, 7 say, 

j{dx) K{x, dy) = j{dy) K{y, dx), 

one can prove that for fixed /3, the stationary distribution of Kp is absolutely 
continuous with respect to 7 with Radon-Nikod ym derivative proportional to 
exp{Pip{x)} (cf. ICatoniillQQa Proposition 1.2, or lBartoH and Del Moral l20niL 
p. 64). This indicates that as /3 increases, this stationary distribution becomes 
increasingly concentrated around the maxima of ip. Now, in the beginning of 
the simulation scheme Pn is small (the temperature is high), and the particle 
9n is allowed to explore the space 8 rather freely. When the temperature cools 
down (/3„ gets large), the particle is more and more lured to the regions where 
ip is large and should in the limit end up at a maximum point of tp. 

Obviously, the kernel K and the sequence (/3„) are important design pa- 
rameters of the algorithm. A typical choice for (/3„) is a logarithmic increase; 
Pn = f3o^og{n + e) for some /3o > 0. We note that with this cooling schedule, 
the acceptance probability in (a2) above becomes 

(n + e)-*(^('^")-^(^))+. (1.1) 

Under additional regularity assumptions one can prove that for (3o small 
enough and if has a single global maximum, it holds that for all e > 0, 

F{ij{9n) < V'max - e) ^ as n ^ c», (1.2) 

where 'i/'max = supj.gQ '(/'(x). How fast is this convergence? In many works on 
simulated annealing the space 8 is assumed finite, and one may then let e 
and thus study P('0(0„) < V'max)- Typically this probability tends to zero at 
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an algebraic rate, see for instance iGielis and MaesI l|l999l Eq. (22)) (take / as 
the indicator function of non-optimal states) and references in this paper. For a 
continuous Q the situation is different. If O C M'' one can show (see Appendix 
that the rate of convergence in H1.2|l is only logarithmic. Alternatively, one can 
prove that there arc numbers C and C" such that for any e > 0, 

P(V'(^«) < V^max - e) < Cn-^'^ (1.3) 

Thus, the algebraic rate becomes infinitely slow as e ^ 0. iLocatellil l)200l|) pro- 
posed a refinement of the annealing scheme that reaches non- vanishing algebraic 
rates, but it requires knowledge of ■0max which is an assumption we do not want 
to make. 

In the present paper we propose a modified simulated annealing scheme such 
that for any e > there is a number such that 

P(^(^„)<^max-e) <an-i/3(l+logn). (1.4) 

We will then say that the rate of convergence is 1/3, up to a logarithmic term. 



2 Description of the new simulated annealing 
scheme 

Just as in classical simulated annealing, the proposed scheme departs from a 
Markov transition kernel K and a cooling schedule (/?«). The difference lies in 
that the exponential function of the classical algorithm's update is replaced by 
a different function, and that the cooling schedule is altered. More precisely, we 
let g : M+ R+ be a C°°-function such that g(0) = 1, g is non-decreasing and 
g{t) —> cxD as t — !■ oo. We set f = 1/g and suppose that / is convex and such 
that supj>o \tf'{t)\ < oo. Then the algorithm looks as follows. 

(bl) In stage n, given the current state 0„, sample a new proposed position Z 
from if (0„,-). 

(b2) Set 9n+i — Z with probability 

and On+i = 9n otherwise. 

In classical simulated annealing g{t) = e*. In Section Owe advocate the par- 
ticular choice g{t) = 1-1- t/r for some r > 0, and thus f{t) ^ r/t as t ^ oo. 
Compared to f{t) ~ exp(— <), this allows the algorithm to be 'more bold' in 
exploring regions far away from the current state. On the other hand we will 
let (in be of order n" with a = 1/3, so that this sequence increases much faster 
than logarithmically. Together, these conditions imply H1.4|) . We also remark 
that with g as above and /3„ = n^/^, the acceptance probability in (b2) becomes 
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which should be compared to 1)1. we see that (|2.5|) decays much slower as 
ip{9n) — '4'{Z) — *■ oo, and thus again that the new algorithm is less likely to 
reject proposals with function values far below the current one. 

Modifications of the acceptance function f{t) ~ exp(— t) of classical simu- 
lated annealing to speed up convergence rates have been discussed extensively 
in the statistical physics literature, and is there often referred to as 'fast sim- 
ulated annealing'. The acceptance function f{t) — 1/(1 + t/T) introduced 
above is similar to funct ions used in such pape rs; for instance, it corresponds 
to A = 1 in Eg. (28) of iGiehs and MaesI l|l999l) . and to = 2 in Eq. (5) of 
iTsallis and Stariold l|l996() . None o f these authors obtained ra te of convergence 
results for these schemes however. iTsallis and Stariold l|l996l Example 3) did 
obtain a convergence rate for f{t) = 1/(1 -I- i)^ and showed that this rate is 
indeed faster than for classical simulated annealing; the result however assumes 
that i/imax is known and these authors worked exclusively on a finite set O. 

We now return to the algorithm and define, for any /3 > and x,y £ Q, 

ap{x,y)^ f{P{iAx)^i:{y))+). 

One step of the above algorithm is then described by a Markov transition kernel 
Kj3 defined as 

Kji{x,dy) = ap{x,y)K{x,dy) + (^^ ^ apix, z) K(x,dz)^ ^^{dy). (2.6) 

Thus, assuming that the initial point is random and drawn from some prob- 
ability distribution 770 on 0, the sequence {On)n>o is an inhomogeneous Markov 
chain with initial law 770 and transition kernels {Kf}^)n>o] more precisely, for 
any n, ^n+i has conditional distribution Kp^iOn, ■). 

We will suppose that 8 is equipped with its Borel cr-field S(8), and we 
will also assume that the Markov transition kernel K satisfies the following 
condition. 

Hypothesis 1. There exists Sk > and a probability measure A on (0,S(O)) 
such that 

y{x,A) e e X B{Q) : £kA(A) < K{x,A) < —X{A). 

Of course, Hypothesis is easier to fulfil if O is compact or bounded. 

Regarding the function -0, we also make some assumptions. Put, for any 
£ > and a < b, 

Jje,a,b ^ {a; g e . ^^^^ -b-e< ■4}{x) < V'max - a - s} . 

Hypothesis 2. The oscillations of ijj are bounded, that is, 
osc(V') ■— sup \'ip{x) — ip{y)\ < 00. 

x,y£Q 

Hypothesis 3. Either one of the following two assumptions holds true. 
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(i) For all e > small enough there are numbers Co(s) > and e' > such 
that for all 6 > 0, 

< or iS>e'. 

(a) The function ip has a single global maximum, ^max say, located in the 
interior of O (which is thus non-empty). The probability measure A is 
absolutely continuous with respect to Lebesgue measure and its density is 
locally bounded. The function ip is in {x : ipi^) > ipmax — which 
is a neighbourhood of ^max (for some e" > Q), and the quadratic form 
'0"(^max) is negative definite. 

The attentive reader will notice that one could replace the assumption of a 
unique maximum by an assumption that there are a finite number of maxima, 
and that one could replace (ii) above by some more sophisticated assumptions 
on the derivatives of ■0. This requires a higher level of technicality but the whole 
proof would contain the same ideas and this is why we write the assumptions 
in this way. 

3 Rate of convergence 

Throughout the remainder of the paper we take /3„ = V 1 for some < a < 1. 
The choice of this particular sequence will be explained in Remark 13.81 We 
denote by (0n)n>o the sequence produced by the annealing scheme for this 
cooling schedule. The main result of the present section is the following. 

Theorem 3.1. Let M = o&c{ili) V osc(V')" and suppose that g{t")/t as 
t — !■ oo. Then for all e > small enough (if Hypothesis\^i) holds) or < e < e" 
(if HuvothesisTB/ii) holds), there exists a > depending on e such that for 
all n, 

P(^(0„) < T^^ax - e) 

under Hupothesis \^i)- or 

PWOn) < 0max " s) 

under Hupothesis \^ ii). 

Corollary 3.2. Choosing a — 1/3 and g(t) = l+t/r, where t > is arbitrary, 
the bounds of Theorem \S. 1\ are Ci^Cn^^/^{l + logrt). 

Remark 3.3. // we want to have terms of the same order in the bounds of 
Theorem \3.1V we see that g{Mn°')'^ /n and f{e'n°') (or f {{s" ~ e)n°') , depending 
on the case) should be of the same order. Thus f(t) should be of order t^^^^ as 
t — > oo. With this choice all terms in the bound have the same order, and so 
there is something optimal to it. With our inequalities, it does not seem possible 
to have a better rate. 
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Remark 3.4. In Corollarv \8.'^^ there is a parameter t > which can be chosen 
arbitrarily. This parameter plays the role of a temperature like in classical sim- 
ulated annealing and can be tuned by the user to optimise convergence. On the 
contrary to classical simulated annealing there is, theoretically, no restriction on 

T. 

Before going into the proof of these resuhs, we wiU proceed through some 
technical lemmas. First we however give some additional notation. The total 
variation distance z^||tv between two probability measures /i and v is defined 
as sup^ Ia'(^) ~ ^(^)|7 where the supremum is taken over the cr-field on which 
the measures are defined. The set of probability measures on (8,,B(8)) will be 
denoted by 'P(e). 

Lemma 3.5. For any (3 > it holds that 

y{x,A) e e X S(e) : Kp{x,A) > £k/(/3osc(i/;))A(^). 

Coro llar y 3.6. The preceding lemma an d Dobrushin's theorem fsee lDobrushiri. 

or Del Moral and Guionnei. \200A } imply that for any /? > and any 
probability measures (i and v on Q, 

Wt-iKp - i^KpWtv < (1 - £k/(/3osc(-0)))|1^ - i^\\tv- 

Proof of Lemma\^ Take /3 > and {x, A) e Q x 6(6). Then 

Kfi{x,A) > / afj{x,y)K{x,dy) 

J A 

> f f{Posc(i,))eK\{dy) 

J A 

= eKf{Posc{^))\{A). 

□ 

The above corollary implies that for any /x, the sequence {fJ,KJ^)n>o is a 
Cauchy s equence in tota l variation norm. Thus there exists a total variation 
limit (cf. iLindvallL I2002L p. 232), which we denote by fj,f3. This probability 
measure is invariant for Kfs, and it does not depend on the particular choice of 
the initial distribution fi. It is hence the unique invariant distribution of Kf^. 

The convergence of simulated annealing hinges on the fact that the law of 
9n, which we denote by r]n, is close to fJ-p„, and that for large /?„ the measure 
fip^ is concentrated on the regions where ip is large. This concentration is the 
subject of the next lemma. We set 

^{xeS: ij{x) > V'max - e}, 

and C/'^''^ is its complement in O. 

Lemma 3.7. For all {3 > and e > small enough ( if Hypothesis [3^^^ ) holds) 
or < e < e" (if Hypothesis\^ii) holds), there is a constant depending on e 
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such that 

^ / ^l+/3osc(i/>) \ 

lipiU^'") < y ( 1 + J f{t)dtj +/(/?£') under Hypothesis\^i), 

lipiU^'") < f{t)dt + f{(3{e" ~ £)) under Hypothesis\B(ii). 

P Jo 

Proof. Fix /? > and s in the appropriate range. We have 

^ip{dx) Kp{x,dy) 

Hp{dx) Kp{x,dy). 



For X £ U^''^ and y G t/"^, Kfj{x,dy) — K{x,dy). Thus the first integral above 
can be bounded as 

fipidx) Kf3{x,dy) = / fif:s{dx)il~ K{x,dy)] 

< 



Hf3{dx) 1 - / EKKdy) 



Similarly, for the second integral, 

Hp{dx) Kp{x,dy) 



< 



xeu^,yeu^ 

fi,3{dx)afj{x,y) K{x,dy) 

xeu^ jyeu^-" 

fj.f3{dx)af3{x,y)— X{dy) 
< — I /(/3(^max - e - yj{y))) X{dy), 

SO that 

MU^'l < . / /(/^(V'max - e - i,{y))) Xidy). 

s^r^l^ ) Jyeu^'" 

To finish the proof we will now bound the above integral as in the statement 
of the lemma. If Hypothesis Oli) holds, take S = 1/(3 and proceed as 

osc(i/') /5 

fmm.. - e - i^iy))) X{dy) < ^ f{t)XiU'''^^'^^+'^) 

osc (-;/;) / 5 

i=0 i>6'/S 
l+osc(V0/5 



< Co(e)<5 ( /(O) + / f(t)dt]+f[j 







If HypothesisETii) holds we employ Morse's lemma (see e.g. lBereer and GostiauxL 
Il988l Theorem 4.2.12) to make a change of variables in {x : il^{x) > V'max — £"} 
such that with some bounded function ^ (that only depends on 



/(/3(V'max-£-^(2/)))A(dy) 

< / /(/3(Vmax-£-^(2/)))A(dy) 
+ / f{(3{e" -e))X{dy) 

Jj/;(y)<j/)max-e" 

< / fm^-e))mdt + fw{s" -e)) 

<./3osc(i/j)^ 

-iMl nn)du + f[(5{e" -e)), 

after a change of variable u — j3{t^ ~ e) . □ 

Remark 3.8. In the following we will show that ?]„ is close to pLp^. Using 
Lemma \3. 7\ to bound Ai/3„ (C^^'*^), we obtain a bound larger than l//3„. We would 
like to compare rin{U^''^) to a power of n, so it is natural at this point to take, 
for some a > 0, 

/?„ = V 1. 

For technical reasons appearing in the proof of Theorem we need to take 
a < 1. 

The law 77„ approaches which becomes increasingly concentrated on 
regions where tp is large, but at the same time /3„ is changing. The following 
lemma serves us to bound the distance between /i^ and /i/j'. 

Lemma 3.9. With C = (l/e^) supt>o \tf'{t)\ it holds that for any fi' > (3 >Q, 
Proof. We have, using Corollary 13. 61 

ll/U/3 - /-i/3'||TV < Wl^pKp - HP'KpWty + W^lfi'Kp - ^ip'Kp^WTY 

< (1 - £_ft:/(/30Sc(V')))||/i/3 - Af/3'||TV + sup WllKp- ^iKfi^W^Y- 

Pick /i e ^(0)- We may construct two coupled samples from ^Kp and ^Kpr 
respectively by first sampling x from /i, then sampling y from K{x, •), sampling 
U from the uniform distribution on (0,1) and finally accepting the proposal y 
ii U < ap{x,y) or U < ap>{x,y) respectively. Similarly to Appendix IbI we may 



- M/5'IItv < Cg{f3osc{ip)) ( -T - 1 



8 



then conclude that 

WfiKp - fiK/s^W^Y < sup \ap{x,y) - ai3,{x,y)\ 

x,y£Q 

< sup \fi(3u)-fiP'u)\ 

0<'U<osc('i/') 

< sup i\fi^u)\(3u)(^-l 

0<Ji<osc(i/') \ P 

where ^ = ^(u) is a point between /? and /?'. Since / is assumed convex and 
non-increasing, and hence |/'| non-increasing, it holds that \f'{S,u)\ < 
We thus arrive at the bound 

eK/(posc(?/>)) o<„<og(,(^) V/ii y 

Since is assumed bounded, the proof is complete. □ 

Proof of Theorem \8.1\ Set A„ = ||'7n ^ M/3rellTV- If we can prove the inequality 

A„ < C^SM^^^±^. (3.7) 
n 

the result will follow from Lemma 1X71 and . in case of Hypothesis|2Ii), the bound 
<?>!• 

In order to prove 1)3. 7|l the assumption g{x°')/x — > as a; — s- oo will be 
instrumental. We start by deriving a recursive bound for A„. By Corollary 
and Lemma 13.91 we have, for all n. 



< (1 - £A-/(/3„ osc(7/;)))h„ - /i^„ IItv + gin'' osc(V.)) 



A„+i < WVnKp^ - /i/3„if/3„||TV + ||M/3„ " Ai/3„+il|TV 

P. 

<(l-e^/(/3„oscW))A„-^C ^(""°;y» . 

n + 1 

Iterating this recursion yields 

A..i<E n (l-s./(AoscW))xC^(^^^ 

n 

+ JJi^- £KfiPkOSc{i>))) X Ijryi -^ftHxv, 
fe=i 

where an empty product (when g = ti) is interpreted as unity. 
Define F such that F'{x) = /(x"). Then for 1 < g < n - 1, 

71 11 

log W (l-£A'/(/3fcOsc(^)))<- E eK/(osc(^)fc") 

k=q+l k=q+l 

rn+l 

< —£k / /(osc(7/;)a;") c?x 

= ^(^^(osc(V')(n + 1)) - ^^(osc(^)(q -f 1))). 
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For q ^ n this is an equality. Putting Ci ~ SkI osc('0) and Ci = osc(?/') we 
thus obtain 



dx 



Denote the integral on the right-hand side by /„+i. First we notice that 
since g > 1, In+i — > oo as n — s- cx). Next we rewrite this integral as 



In+l — 



ri+1 



C1C2 a; 



where CiC2f{C2{x+l)°') is the derivate of the exponent. By partial integration, 
integrating the first factor of the integrand above and dropping all negative 
contributions (recall that g and g' are non- negative), we obtain the bound 



In+l < 



„C.FfC,fa+l)) g(g2"(^+l)").9(g2X°) 



C1C2 



n+1 



n+1 



,c,F(c.(.+i)) 5M(^±in g(t^ 

C1C2 ^2 



(3.8) 



Denote the integral on the right-hand side of (|3.8|l by I'n^i- This integral is 
similar to In+i, the difference being that the integrand is multiplied by a con- 
stant times 5'(C^(a;-|- l)")/x. Since this ratio tends to zero as x — > 00, and since 
In+l — > 00 (as noted above), it holds that for any < k < 1, /'j+i — ^^n+i for 
sufficiently large n. Hence 



^n+l 



< 



1 



1 - K 



.c.F(c.(.+i)) g(g2"(^ + i)")g(g2x") 

C1C2 X 



n+1 



< C'e<^i^^(C2(n-H2)) 



g{M{n + 2)°'f 
n+1 



for sufRciently large n; recall that M — C2 VCf'. Summing up thus far, we have 
shown that 



A„+i <C[e 



< C 



Ci(F(C2(n+2))-F{C2(n+l))) 9{M{n + 2)"f ^ ^-Ci F(C2 («+l)) 

n+l 

g{M{n + 2Yf 



^ g{M{n + 2rY ^ ^_c.F(c.(n+i))^ ^ 



(3.9) 



where the second inequality follows as F' is bounded. 

Now take an arbitrary m > 0. Since g{x°')/x — > there is an Xm > such 
that g{x°')/x < 1/m for x > x,n, or, equivalently, f{x") > m/x for x > Xm- 
Integrating this inequality yields F{x) — F{xm) > m log(a;/a;m), so that 

— mCi 

g-CiF(x)+CiF(x„) < ^ ^ " 
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for X > Xm- Picking m such that mCi = 1 we see that as n cx3, the second 
term on the right-hand side of (|3.9() is of smaUcr order than the first one. We 
conclude that 

- ^ — — ' 

which is iflTTjl . □ 



4 Simulated annealing on a function that cannot 
be computed exactly 

In this section we assume that the function to be maximised cannot be com- 
puted exphcitly, but that we have available an approximation to it. This approx- 
imation, denoted by -0^, can be stochastic, based on Monte Carlo procedures; 
the next section shows such an example. The precision of the approximation, 
stochastic or not, is indexed by an integer-valued parameter TV, and the larger 
the N , the better the approximation. This parameter can be, for instance, the 
number of replications in a Monte Carlo method. The following hypothesis 
makes precise the quality of the approximation. 

Hypothesis 4. For all N > \ we can compute a deterministic or stochastic 
approximation of ip such that 

E\iP^ (x) - ^j{x)\ < forallxee, 
V N 

and, almost surely, 

\^p^{x) - ^p^{y)\ < 2osc(V') for all x,y(^Q. 



We suppose that this hypothesis holds true in all of the following. The 
attentive reader will notice that the second of the above assumptions can be 
replaced by the existence of a constant C such that, almost surely, {x) — 
'i'^ {y)\ ^ C for all x, y G 9. In the case of approximation by a sample mean of 
i.i.d. summands, the first part of the hypothesis follows from the Marcinkiewicz- 
Zygmund inequality; see Appendix IbI for more details. 

The sequence (/3n), the coohng schedule, is again chosen as 

/3„ = V 1, (4.10) 

although below we argue for the choice a = 1/4 rather than a = 1/3 as in the 
previous section. We will let the parameter N depend on the iteration number 
n as well, N — Nn, and we will assume that the increase is affine in n, meaning 
that Nn = \ Nq + Ninl for some numbers iVo > and iVi > where \x~\ denotes 
rounding x upwards to the nearest integer. We comment on other choices of 
{Nn)n>o following the proof of Theorem 14 . II below . 

We now formalise the simulated annealing procedure in this modified con- 
text. The procedure is again described as a random sequence, denoted by 
(^n)n>Oj with 00 sampled from the law rjo (as is ^o)- The function g is cho- 
sen as in Corollarv l3.2l and (^„) evolves as follows. 
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(cl) In stage n, given the current state 6n, sample a new proposed position Z 
from KiOn,-)- 

(c2) Set Sn+i = Z with probability 

/(/3„(V'^"(^„)-V^"(Z))+) 

and Sn+i = otherwise. 

This procedure requires some comments. In step (c2), ip is approximated at 
two points, On and Z. In the case of random approximations it is unimportant 
whether these two evaluations are independent or not, as we shall see below, 
but it is important that they arc independent of approximations computed in 
previous steps (smaller n) of the algorithm. The reason for this is that, if 
such independence holds, the sequence (^„) forms a Markov chain, and this 
Markov chain is the object of our study. Moreover, the additional randomness 
in step (c2) associated with the phrases 'sample a new proposed position. . . ' and 
'with probability. . . ', typically obtained by drawing random numbers uniformly 
in (0, 1), must be based on two mutually independent sequences of independent 
random numbers, also independent of the function approximations ip^; this is 
just as in the previous annealing schemes however. 

In cases where the random function approximations ijj^ are such that they 

depend on random variables that arc drawn once and for all and then stay fixed 
over n (sometimes called 'fixed randomness'), so that tp^ is fixed at each point in 
9, we can, as long as N stays fixed too, apply the results of the previous section 
to the function tjj^ provided that it satisfies the regularity assumptions made 
there. Main questions are then rather whether these assumptions indeed are 
satisfied for , and how well the maximum of ip^ and its location approximate 
those of Ip. 

Wc now return to the sequence ((?„). As noted above, this sequence is an 
(inliomogeneous) Markov chain. For any /3 > and > 1, we define the 
function 

<(.^,?y) = .f(/5(^^(.^)-V'^(2/))+). 

For fixed x and y this is indeed a random variable, the randomness coming from 
the evaluations {x) and ^p^{y). We write E^r for the expectation with respect 
to the random variables used to compute tjj^ at a point for some approximation 
index N, and Pjv for the corresponding probability. The kernels K^^ of (^)ra>o, 
defined by 

(x, A) = P(^„+i eA\en = x) 
for any x G Q and A e 13{Q), can then be expressed as 



K^{x, dy) = En {x, y)K{x, dy) +(^~ j^a^ {x, z) K{x, dz)^ 5,{dy) 

(4.11) 

The final assumption we make before stating the main result of this section 
is the following. 

Hypothesis 5. There is a constant Ck such that for all fi > Q and N' > N > 1, 
sup \\f,K^ -f,Kf\\Tv<CKP^^^^^. (4.12) 
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In Appendix ^ we discuss this condition in detail for approximations ob- 
tained as sample means of i.i.d. random variables, and for approximations ob- 
tained using so-called particle filters. It turns out that Hypothesis |5l can often 
be verified through a coupling argument; that is, we couple the approximations 
tp^ and ip^ in a suitable way. We notice that by such an argument it also 
follows that provided Hypothesis ^ holds, one can bound the left-hand side of 
(|4.12|) by a constant times /3/ ^/N ; the actual assumption above is thus stronger. 

Theorem 4.1. Assume that /3„ is as in j^. with a < 1/2 and that Nn 
increases linearly with n. Then under Hypotheses^\^ ^ an(i[3 for all e > 
small enough (if Hypothesis\^i) holds) or < e < e" ( if HvvothesisTWii ) holds), 
there exists a constant C'^ depending on e such that 

P(^(^„)<^max-e) <C^(n-"lognVn3"-i) (4.13) 

for sufficiently large n. 

Equating the two powers of this bounds leads to a = 1/4 as the optimal 
choice, with corresponding rate of convergence n^^/'* log7i. 

The proof of Theorem l4.1l is very similar to the proof of Theorem l3.1l before 
going into it, we will proceed through some technical lemmas. The following re- 
sults can be shown exactly in the same manner as Lemma 13.51 and Corollarv l3.6l 

Lemma 4.2. For any fS > Q and N > 1 it holds that 

y{x,A) e e X B{e) ■. K^{x,A) > £kA(A)/(2/5osc(V')). 

Corollary 4.3. For all (3 > Q, N > \ and any probability measures /i and v on 

\\y.K^ - vK^W^y < (1 - £K/(2/3osc(^)))||/i - v\\^y . 

We point out, in particular, that these results hold true regardless of whether 
the two function approximations required for computing (x, y) are indepen- 
dent or not. 

The results imply that for any /3 > and > 1, the kernel has a 
unique stationary distribution, which we denote by . We will show that 
under certain conditions, /i^ is concentrated around the maximum of ip- 

Lemma 4.4. For all (3 > 0, e > small enough (if Hypothesis \^i) holds) 
or < e < e" (if Hypothesis \^ii) holds) and N > 1 such that N > 0^ and 
N > (Sai/ef, there is a constant C" depending on e but not on N such that 

Proof. We proceed as in Lemma l3 . 71 and thus write 



/ fi'^ {dx) K'^ {x, dy)+ fi'^ {dx) K'^ (x, dy) 

ixeU'''=,yeu^''' J JxeU'.yeu^'" 
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and 



lx£U^-'',y£U' 

Yov xeU^" it holds that 



{dx) K^{x, dy) = / {dx) (l - [ K^ix, dy)] . 



K'^{x,dy)=EN 



fm^'^x)-r{y))+)K{x,dy) 



> 



> 



> 



yeu^ 



eK^MW\y)>r{^))Kdy) 



yew 



eK[l - ^Ni^'^ix) - i^{x) - {y) + V(y) > e/2)] \{dy) 



N I 



y£U'/^ 



yeU'/^ 



N 



where ai is in Hypothesis 01 and we used Markov's inequahty and the assumption 
N > (8ai/£)2. Hence 

/ t^^idx) dy) < t^^iU^n (l - ^A(t/-/2)) 



and 



eU^,yeU''<' 



Hp {dx)Kp (x,dy). 



The integral in this bound equals 

f,^ {dx)fm''{x) - ^-^(2/))+) K{x, dy) 



xeU^,y£U^ 



< E 



N 



1 



idx)—f{p{r [x) - r {y))+) Hdy) 



fi^{dx)f{(3{^{x)-^{y))+) 



£k J Jxeu^ ,yeu^ 



f f{mx)-^{y))+ + PR''ix,y)) 
\ Hmx) - Hy)))+ 



\{dy), 



wheie {x , y) — {ip^ (x) — {y))+ — {■iIj{x) — ^p{y))+. Notice that the expres- 
sion inside the final expectation is bounded by g{(i{'ip{x) — ^{y))+). Thus the 
expectation itself, inserting g{t) = 1 + t/r, may be bounded as 



l + §{yjix)-i;iy))+ + §R^{x,y)J 

l + f(V(a;)-V'(2/))+)^ 



< 2 



N 



^R'^ix^y) 

T 



< 



+ [l + ^{^{x)-ip{y))+ 



< 2 + ^EAr|i?^(a;,y)|. 

T 



N 



T 



> 



l + ^{^{x)-^{y))+) 
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Now notice that in the expression for R'^ {x, y), ip{x) — ip{y) > for those x 
and y appearing in the integral. It is easy to check that for any real a and b such 
that 6 > 0, \a+-b\ < \a-b\. Hence \R'^{x,y)\ < \Tp^ {x)-Tp{x)\ + \ip'^ {y)-'ip{y)\, 
and the above expectation is thus bounded by 2 + A(3ai/{T^/N) < 2 + 4ai/T, 
where we used the assumption N > 0^ . 

As in the proof of Lemma [3. 71 we may conclude that 
^Ji''p{dx)K^{x,dy) 

< / /(/3(V'max - S - Kdy). 

Summing up thus far, we have proved that 

This integral can be bounded just as in the proof of Lemma 13.71 and with 
f{t) = 1/(1 + t/r) these bounds are of order C'J{1 + log/3)//3. □ 

We now formulate an analogue of Lemma 13.91 
Lemma 4.5. For any /?' > /3 > 0, 



jv|| ^ 1 + 2/3oscW/t fP' 



Proof. We have, using Corollarv l4.3l 

II, ,N ,,Nu ^\\,,Nt^N ,,N t^Nu , II,, Nt^N ,,Nt^Nu 

< (l-£K/(2/3osc(V)))||/i^-M^llTV+ sup ll/iX^-MA'^llTv. 

Using H4.11|l , Hypothesis 01 and an argument as in the proof of Lemma 13.91 we 
find that for all fi G ■p(e), 

WfiKp ~ fiKp,\\Tv < sup EN\a^{x,y) - ap,{x,y)\ 

< sup \f{(3u) - fip'u)\ 

0<u<2osc(i/>) 



< sup {\nf3u)\Pu) - 1 

O<n<2osc(-0) VP 



O<n<2osc(-0) 

With g{t) = 1 + t/r, the above supremum is bounded by 1/4. Thus 

11,,^ „^|| 1 UP' ,\ l + 2/3osc(V;)/T f (3' 
^^"^ - '^^'"^^ ^ SK/(2/3oscW)i [j-'j- ^ U " 



□ 
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Proof of Theorem \4-.l\ First we notice that given the assumptions, including 
a < 1/2, Lemma ^31 shows that /j,^" ([/'^''^) is bounded by C"7i~"(l + alogn) 
for sufficiently large n. This term is the first part of the maximum in (|4.13|l . 

Next we denote by fjn the law of On and put A„ — ||^„ — ||tv- Write 



1 + 2l3nOSc{lp)/T ( Pn+l 



< (1-£k/(2/3„osc(7/;)))A„ + 

+11/^1",, -m;:;,1tv, (4.14) 

where we used Corollarv l4 . 81 and Lemma E31 to bound the first two terms. With 
our choice of /?„, the second term on the right-hand side is of order n°'~^. 

To bound the third term we proceed as in the proof of Lemma 14.51 use 
Corollary 14. HI to see that for any f3, N and N' , 

,,N'\\ ^\\,,Nt^N ,,N't^N\\ _^\\,,N't^N ,,N't^N'\\ 
- IItV < - M/3 I|tV + ||A*/3 J<f3 " -^/S I|tV 

< (l-eK/(2/3osc(7A)))||/i^-/if IItv+ sup -fiKf Wty 



to arrive at 



Apply this bound with f3 — Pn+i, N — Nn and N' — Nn+i to sec that the 
final term of H4.14|l is bounded by a constant times /3^_|_;^(-/V„+i — Nn)/Nn under 
Hypothesis [SJ this ratio is of order n^"~^ given that Nn is assumed to be afhne 
in n. 

Summing up thus far, we have proved that 

A„+i < (1 - £k/(2/3„ osc(V'))) A„ + 



for some constant C. Using this inequality we can show as in the proof of The- 
orem l3.l1 that for all e > 0, A„ < C^n"^""^ for some constant depending on 
e. Indeed, in the proof of Theorem 13. II replace the factor x in the denominator 
of the expression that forms the integrand in In+i by x^^" and proceed from 
there. The term C^n^"-^ is the second part of the maximum in (|4.13(l . □ 

One may consider other ways of increasing Nn, for instance iV„ — \No + 
Nin^l for some S > 0. For S > 1 the expression (A'^„+i — iV„)/iV„ is then still 
of order however, so there is no improvement in the proof of Theorem 14. II 
compared to the case of affine increase. For S < 1 the above expression is of 
order , since the 7V„ are integer- valued. The bound corresponding to (|4.13(l 
then becomes of order n~" logn V n^"~^ , with the optimal a being 5/4. 

The above seems to suggest that the rate n~^/'^logn of Section |3| is unob- 
tainable when the function ijj is approximated. This is not the case however, 
but it requires a slightly different approach to analysis than above, and also 
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typically a faster increase of Nn- In the proof of Theorem 14.11 we compared fjn 
to fJ-p" ■ Consider instead comparing to /i/3„, as in the proof of Theorem 13.11 
and write 

+ M/3„ + i^^"+i - M/3„ + i-f^/3„+i- 

On the right-hand side the total variation norm of the first difference is bounded 
by (1 — £ii:/(2/3„ osc(i/')))||^„ — ^/3„||tv (^Corollary I4.3|l . and the norms of the 
remaining differences are bounded by terms of order n"~^ fLemma 13.911 . 
(use part of the proof of Lemma [4.5(1 and n" /Nn^^ respectively. To obtain the 
order n" /Nn"^ of the final term we can couple the kernels Kp and in a way 
similar to that used in the first part of Appendix^ thus obtaining a bound on 
the total variation distance of order /3supj.gQ EAr|-0Ar(a;)— by HypothesisQl 
this expression is 

of order /3/iVi/2. Thus we do not require Hypothesis for this 

analysis. 

We can now put A„ — \\fin — M/3,J|tv and mimic the proof of Theorem 14. II 
To obtain the rate of convergence n~"logn, the norms of all differences on 
the right-hand side, except the first one, must be of order n~^". This in turn 
requires taking a < 1/3 and Nn of the order n^" . In particular this applies when 
a = 1/3, so that this rate is obtainable but at the cost of quickly increasing Nn 
at rate n? . We also notice that when a = 1/4, to obtain the rate of convergence 
n~^^'^ logn it is required to take Nn of order n^/^, which is larger than the linear 
rate used in Theorem 14.11 

However, a more fair way to look at convergence rates is to express them in 
terms of the number of numerical operations performed. We assume that the 
computational cost of computing an approximation ip'^ (x) is of order N; this 
is for instance the case for the Monte Carlo schemes discussed in Appendix IbI 
With Nn being affine in n, the total computational cost up to stage n of the 
simulated annealing scheme is thus of order n^. Denoting the total number of 
numerical operations performed by C, we then find that the convergence rate is 
of order C-^/^\ogC. If we rather use the second bound above, which requires 
Nn of order n^", we see that the computational cost up to stage n is of order 
^&a+i ^jig convergence rate is of order (7-"/(6a-i-i) logC for < a < 1/3. 
The optimal a is a = 1 /3, with rate C'^/^ log C. This is inferior to C'^/^ log C, 
so that the decomposition of the proof of Theorem l4.1l is superior; it does require 
Hypothesis however. 



5 A numerical illustration 

In this section we consider simulated annealing applied to the likelihood function 
of a state-space model as in Appendix IB. 21 Thus assume that we have an 
observed sequence {yt)i<t<T from a state-space model {{St,Yt))i<t<T, whose 
Markov transition kernel Q and conditional output densities r{-\s) both depend 
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on an unknown parameter (vector) 9 which we wish to estimate using maximum 
Ukehhood. 

The log-hkehhood function that we aim to maximise is 

T T „ 

^t{0) ^^\ogpg{yt\yi:t~i) ^^log / re(yt|s) 7rf|j_i(ds), 
t=i t=i 

where pg{yt\yi:t~i) is the conditional density of Yt given Yi-t-i, and 7r^|(„2 
the predictive distribution Fg{St E ■ \ yi:t-i)- As T^fu_i can in general not be 
computed we need to approximate the log-likelihood function, and one way to 
do that is through 

where we take Tr^^^^{ds) as the particle filter approximation of Appendix IB. 21 

The log-likelihood function is essentially a sum of functions of the form 
studied in Appendix IB. 21 except for the logarithmic transformation. Assuming 
however, as in Appendix IB. 21 that rg is uniformly bounded from below by some 
r > 0, we find that each of the integrals above are bounded from below by 
r. Moreover, using the inequality |logx — logy| < |a; — y\/{x A y), valid for all 
x,y > 0, we find that 



T 

reiyt\s)Tr^^u^^{ds) - / rg{yt\s) ir^^^^^ids) 



t=i 



This bound involves sums of functions of the form studied in Appendix IB. 21 
(take h{s) = rg{yt\s)), and we can proceed as there to show that Hypothesis 
holds. A similar argument where we replace irid) by the exact likelihood and 
appeal to Theorem 7.4.4 of iDel Morall l|2004j) shows that Hypothesis 0] holds. 



5.1 Simulation study 

We considered the benchmark model l|Doucet et oi.Ll20f)ll Eqs. 8.3.4-8.3.5) 



St~i 



aSt-i + 'I + 7C0s(1.2t) + a,Vt, (5.15) 



Yt = ^ + a^Wt, (5.16) 

where (St) is the unobserved Markov chain taking values in M, (Yt) is the ob- 
servable process and (Vt) and (Wt) are mutually independent sequences of i.i.d. 
standard Gaussian random variables. We wish to estimate the five model pa- 
rameters = (a, 6, 7, cr^, (Juj) given a sequence {yt)i<t<T of observations, and 
we did so using the approximate maximum likelihood (ML) approach outlined 
above with the bootstrap particle filter, i.e. particle mutations following the sys- 
tem dynamics (|5.15|l . We remark that the state space of the model above is not 
compact, so that the conditional densities rg{y\s) are not bounded from below 
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Figure 1: Normal probability plots of approximate ML estimates of parameters 
(a, 5, 7, (Ji,, (Tu,) in the model H5.15|l - H5.16|l . obtained from 150 replications of 
5,000 iterations of the simulated annealing scheme applied to the particle filter 
approximation of the log-likelihood. 

in s. The model does thus not fulfil the technical conditions made above, but 
the results below are still an illustrative example of how the simulated annealing 
scheme performs in a particular case. 

We simulated a single trajectory {yt)i<t<T of length T = 500 with pa- 
rameters 0° = {a°,b°,j°,a°,(7Z) = (0.9, 18, 10, a/To, 1). In the simulated an- 
nealing scheme we let the inverse temperature be /?„ = lOrt^/"*, correspond- 
ing to r = 1/10 in CoroUarv 13.21 and let number of particles at step n be 
Nn = n V 20, a function which is affine for n > 20 (Theorem 14. 1|) . The 
algorithm was run for 5,000 iterations in each of 150 independent replica- 
tions. The parameter space Q was taken as the five-dimensional hyper-rectangle 
[0.45, 1.8] X [9, 36] x [5, 20] x [0.316, 36] x [0.5, 2]. For K we used a Gaussian ran- 
dom walk proposal (on the log-scale for the standard deviations), where we 
constrained the random walk to Q; any coordinate of the parameter proposed 
outside 8 was pulled back to the boundary. The incremental covariance of the 
kernel at step n was a diagonal matrix whose z-th diagonal element was the 
squared i-th side length of 8 divided by log(7i + 1)^. In each replication the 
initial point Bq was drawn uniformly on 8. 

After 5,000 iterations of the simulated annealing algorithm, the sample 
means and standard errors of the parameter estimates ^5000 (over the 150 repli- 
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cations) were (0.85,19.1,10.1,3.4,1.01) and (0.024,3.0,0.46,0.41,0.11) respec- 
tively. These sample means are in good agreement with the true 9^. Ideally 
we would like to compare to the ML estimates, which are however unavailable. 
Figure H shows that the estimates follow normal distributions with good accu- 
racy, with the exception of ay. This of course is an empirical observation for 
which we have no theoretical support, as we have not discussed convergence in 
law of the differences 6'„ — 6'inax and 0„ — 6'max, suitably scaled, where 0max is 
the point where ip is maximal. 
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A Rate of convergence of classical simulated an- 
nealing 

In this section we prove the bound (|1.3|l and also, by studying a specific example, 
that this bound cannot be improved generally. We assume that Hypotheses EEl 
and Hypothesis 01 ii) hold. Since we now consider classical simulated annealing 
we have fit) = exp(— ii ) and we take /3„ = /3o log(n + e) with l//3o > osc(V') (cf. 
iBartoH and Del Moral boOll Theorem 2.3.5). As in Sectional we let 7]n be the 
law of 9n and denote by jip the invariant distribution of Kp. 

Now write 



We will show that the first term of this decomposition (the difference) tends 
to zero at algebraic rate, while the second term vanishes only logarithmically 
fast. Thus the left-hand side tends to zero at logarithmic rate too. In a specific 
example we will also show that the logarithmic rate for the second term, which 
in general is a bound, is in fact the exact rate; thus the logarithmic rate for 
the left-hand side cannot be improved generally. Here emerges an essential 
difference between classical simulated annealing and the new scheme analysed 
in Sectional In both cases the total variation distance between the law rjn of On 
and the invariant law vanishes at algebraic rate; for classical simulated 
annealing (see below) and n?'^~^ for the new scheme fTheorem l3.1|l . The rate 
at which jip^ concentrates around the maximum of tl) is much different however; 
this rate is algebraic too for the new scheme fLemma l3.7|l . but only logarithmic 
(or algebraic with rate tending to zero) for the classical scheme. 

We now proceed to the details. Put once again A„ — ||?7„ — A^^„||tv- We 
then have the recursion 



see iBartoli and Del Morall l|200ll Remark 3.3.13) and cf. the proof of Theo- 
rem l3.1l With the present choice of (/3„) we find fin+i — < Pa/in + 1) and 
exp(— /3„ osc(V')) = (n + e)~" with a — /3o osc('(/;) < 1. Iterating the above 
recursion yields 



where an empty product (when g = n) is interpreted as unity. Bound the 



A„+i < (1 - e^e-'^" °^^('^))A„ + (/3„+i - /3„) osc(^); 





21 



product as 

n / \ 11 

k=q+l ^ \ ' J / k=q+l ^ ' 

dx 

= + e + 1)1-" -{q + e + 1)1""), 

where Ci = ei<-/(l — a). Thus, using /3oOSc(V') < 1 again as weU, 

n 

9=1 ^ ^ 

^n+1 1 
< g-Ci(n+e+l)i-° / gCi(a;+e+l)i-°_|_^^ _^ ^g-Ci(n+e+l)i-°^ 

By manipulating the integral on the right-hand side, In+i say, we can just as in 
the proof of Theorem 13. II prove that 



1 



Hence we obtain 

" ~ {n + 

C 

< 



{n + 1)1-" 

and thus A„ < C/n^-". 

So far the difference between ?7„ and /i^j^ . We now turn to how concentrated 
fxp^ is around the maximum of ip. To start with we may employ Lemma |3.7I 
with / and /3„ as above, to obtain 

t^pAU'^l < -^^-^ T + (n + e)-*(="--'); 

"^""^ ^ - /3o log(n + e) 

a logarithmic rate in other wo rds. We can also use the property mentioned in 
I Bartoli and Del Morall I2OO1I p. 64), that equals ex.p{P^p{x)) j{dx) up to a 



normalising constant with 7 the invariant distribution of K, to obtain 



i{dy) 



„/3(Vmax-e) , 
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Inserting /3„ for f3, it follows that 

This is the bound ((0|) . 

We now prove that this bound cannot be improved in general. Consider 
the example 9 = [—1/2, 1/2], ip{x) = —\x\, K{x,dy) = dy. Thus K is an 
independence kernel that proposes uniformly on O. It is immediate that the 
invariant measure 7 of if is Lebesgue measure on Q, and that 7 is reversible. 
Now fJ.f3{A) is proportional to J^exp{—P\y\) dy, so that 

g-/3|al dy 
Je 

We can indeed, by an obvious modification of the argument above, adjust l)A.17|l 
into the bound 1/^{U'^^) x where < i5 < 1 is arbitrary. The rate 

of this bound thus can thus be made arbitrarily close to the exact rate of this 
example. 



B Coupling function approximations 

The purpose of this appendix is to illustrate how one may construct function 
approximations that satisfy IIypothesis[Sl and how the relatively 'high level' 
condition of this hypothesis can be guaranteed by more 'low level' assumptions. 

Thus assume that we are given a probability measure /i on 0, /? > 0, and 
two approximation indices N and N' . We wish to bound Wf^K^ — i^K^ ||tv = 
supyi \l^K^ (A) — (A) I, where the supremum is over A E 8(0). We will 

accomplish this by constructing two coupled samples from fiK^ and jJ-K^ 
respectively as follows. 

(i) Sample a point x from /i and then a point z from K{x, •). 

(ii) Compute the function approximations ip'^ (x), ip^ (z), ip^ (x) and (z). 
For the time being we do not specify exactly how this is done. 

(iii) Sample a random number U from the uniform distribution on (0, 1) and 
accept the proposal z if f/ < f{(3{^^{x)-ip^{z))+) ot U < /(/3(i/'^' (x) - 

{z))+) respectively, for the two indices N and N'. 

The samples fiK^ and l^K^ so constructed will be different only if the two 
decisions is step (iii) are different, so the probability of the former event is 
bounded by the probability of the latter one. To compute the probability that 
the decisions of step (iii) differ, we notice this event occurs if U falls in between 
the two function values used there, which, since U is uniform, happens with 
(conditional) probability 

|/(/3(^^(x) - V^(z))+) - /(/3(VA(x) - V'^'(^))+)l- 



23 



Hence the probability of different decisions in step (iii) is bounded by 
snp E|/(/3(V^^(x) - ^^(^))+) - fm'^'ix) - 

where the expectation is w.r.t. the function approximations -0^ and -0^ . The 
difference of the function values can be bounded as 

/3|(0^(a;) - - (V-'^'W - V'^' X /'(C), 

where C is point between the two function arguments. By the assumptions on 
/ its derivative is necessarily bounded, and it is straightforward to check that 
for any real a and b, |a+ — 6+1 < |a — 6|. Therefore the probability of different 
decisions in step (iii) is bounded by 

/3||/'||oo sup E|(0^(a;)-0^(z))-(0^'(x)-VA(z))| 

< 2/3||/'||ooSupE|V'^(x)-0^'(x)|. 

Thus, at this point we see that if the function approximations satisfy 

N' — N 

sup E|0^(x) - 0^ (x)| < C— — (B.18) 

for some constant C, Hypothesis will hold. 

Verifying (jB.lSp is, of course, a problem very much related to the specific 
construction of these approximations. In the following two subsections we will 
deal with two specific settings: i.i.d. sample means and particle filters. 



B.l Simple Monte Carlo sample means 

Here we consider the possibly simplest of all approximation schemes: a sample 
mean of i.i.d. summands. Thus we assume that for a random variable f with 
some known distribution and some known function h, ipi^) — E/i(^; x) where 
the expectation is w.r.t. ^, and that its approximation is 



1 ^ 



1=1 



where the £,i are i.i.d. variables distributed as ^. We note in pas sing that for 
this scheme the Marcinkiewicz-Zygmund inequality l|Shirvaevi 1 1 995[ p. 498) with 
p = 1 implies that Hypothesis^ holds. Moreover, for N' > N, 



0^ (:.)- 0^ (x) =(---] 5: M6;a:)-- ^ ^6 



i=N+l 



and 



E|0^(x)-0^'(x)| < (^l-l^jNEm;x)\ + lJiN'-N)Em;x)\ 



N' - N 

2E|Me;x)|^^. 



It is now immediate that if E|/i(^; x)\ is bounded in x G 8, (jB.lSp holds. 
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B.2 Particle filter estimates 



Consider a state-space model {{St,Yt))t>i, where (5*4) is an unobserved Markov 
chain on some general state space and (Yt) is an observed sequence of random 
variables. The association between (St) and (Yj) is local in the sense that (i) 
given (St), the Y- variables are conditionally independent, and (ii) given (St) 
and for any time index u, the conditional distribution of depends on Su only. 

We will denote the transition kernel of the Markov chain (St) by Q, and the 
conditional density of Yt given St = s hy r{-\s). Both of these quantities are 
assumed to depend on some model parameters 0, which we indicate by writing 
Qg and rg respectively. 

The function we wish to approximate is tl;{6) = Ee[/i(S't) \ yi;t-i], that 
is, the expectation of some function h w.r.t. the so-called predictive distribution 
T^tit-ii') —^ei^t G ■ I yi:t-i), where t > 1 is some time index, the notation yi:t_i 
is short for j/i, ^2, ■ ■ • , J/t-i, and subindex H\t—V indicates that the distribution 
concerns the state at time t conditional on observed data up to time t — 1. 

The predictive distributions can, together with the so-called filter distribu- 
tions 7r^|((-) = Pe{St G • \ yi:t), be computed recursively in time — at least in 
principle. The recursive formulae read 

g,,. rg{y\s)7rf^^^,ids) 

-^titids) = -7 ' (B.19) 

J rg{yW)n!^,_,{ds') 

and 

/ Qo{s,-)7:!\ti-)- (B.20) 

The first of these formulae is just Bayes' rule, and the second one means to 
propagate the filter through the state dynamics Qg. 

In practice the above relations do no admit exact numerical solution except 
in two cases: when the state space of (St) is finite (so-called hidden Markov 
models; the integrals then turn into finite sums) and when the state-space model 
is linear with additive Gaussian noise (the solution then being provided by the 
Kalman filter). There are many ways to approximate these two recursions, and 
here we shall examine an approach referred to as particle filters. This section 
contains a full i ntroduction n e ither to state-space models nor to particle filters, 
and we refer to lPoucet et al\ l|200ltl for a more complete coverage of both. 

The basic idea of a particle filter is to approximate the filter and predic- 
tive distributions with the empirical distributions of a set of particles, whose 
positions are dynamically updated in time. There is not just one particle filter 
algorithm — the term rather refers to a framework for algorithms — and the par- 
ticular algorithm we look at here is usually denoted the bootstrap particle filter. 
We now describe how this algorithm works; the parameter 9 and population size 
N are fixed throughout. 

Assume that at some time index t we have available a collection 
(ffl't^i i)i<j<Af of particles whose empirical distribution approximates 7rj|j_j^. 
The transformation l|B.19(l is approximated as follows. 



25 



(a) Weighting. Compute unnormalised weights \ — ?'e(2/t J ^'^d then 

T J ■ s,N ~e,N 1^ ~e,N 
normaused weights — / 2_^- j . 

(b) Resampling. Create a sample {^^^^)i<i<N by sampHng N times indepen- 
dently from (Cf|t^i_i)i<i<A' with weights {wff)i<i<N- 

The empirical distribution of the sample (^f|'/^)i<i<Af obtained in the resampling 
step approximates tt^^^. 

The transformation ljB.20|) is approximated as follows. 

(c) Mutation. Create a sample {^^^^^ j)i<i<N by independently sampling 
^t+l\t,^ from Qe{it\t,^^■)■ 

The procedure is initialised at time t = by letting (^^|'o'i)i<*<^ i.i.d. 
sample of size N from the initial distribution Pg{Si € •) of the state process. 
This distribution may depend on 9 but is otherwise assumed known. 

The book bv lPel Moral l|2004l) is a thorough treatise of theoretical properties 
of particle filters, and in particular its Theorem 7.4.4 shows that Hypothesis^ 
holds, provided that for each j/^, r0{yt\s) is bounded in 9 and s. We are here 
particularly interested in the particle approximations of the predictive distri- 
butions, and the update of these can be summarised as follows: compute the 
normalised weights W-^ \ and then sample for \ < i < N , independently, first 

an index j with probability w^t^ and then Cf+Tit i ^ Qs(^t\'t^i j)' 

We will now run, simultaneously, two particles filters of sizes N' > N respec- 
tively. All other properties of the filers — data, parameters, dynamics — agree. 
The joint dynamics of the filters will be coupled in a way such that many par- 
ticles of the two filters, at any given time index, coincide. Indeed, for each time 
index t we define a partition Jf U Jj'^ of {1, 2, . . . , TV'} such that ^ = j 

for i G Jt. The details of the coupling are as follows. 

(i) Initialisation. Sample {£,^Q^)l<i<N' independently from Fe{Si G •), let 

^l\o^ = ^i\ol for 1 < i < TV and let Ji = {1, 2, . . . , TV}, Jf ^{N + 1,N + 
2,...,N'}.' 

(ii) Recursion from t to t+1. We have ^f^^i j = ^fft-i i for i & Jt and compute 
the weights iw^f)i<i<N and {w^f )i<t<N'- 

When sampling the new particles, we couple the two filters in a way such 
that independently for each I < i < N , one of the events below take place 
(index j has the same meaning as above); 

- for j € Jt, 

- Ct+i|i,, = ^t+l\t,^ - Qe(Ct|t-ij> ■) with probability w,^^ A w^^ ; 

- ^t+i\t,2 ^ Qeiit\t-i,p') ^it^ probability w^ '^- - w^ '^ A w^;^ ; 

- it+i\t.i ~ Qe{it\t~ij^') "^^^^ probability ^ - w^^j A w^j ; 
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for J e j,-n{i,2,...,7V}, 



e,N 

t+l|i,i 
N' 



) with probability 'w^ ^ , 
t+l\t.^■~ ^t'vs,t|t_ij, ■) with probabihty w^f . 

Finally, for iV < * < N' , ^f;^,;^^ ~ 

We let Jt+i be the set of indices 1 <i < N such that the first of the above 
events happened. 



.) with probability w^'^ . 



From this construction it is immediate that the distributions of the two 
filters are the same as if they had been run separately and independently in the 
usual manner. Let 'n'^^i be the particle filter approximation to the predictive 
distribution at time index t; 



e,N 



for all A € B{Q)^ where I a is the indicator function of A. 

Proposition B.l. Assume that observations yi:T are given and that there is a 
number r > such that r < re{yt\s) < l/r for all 1 < t < T, all s in the state 
space and all 9 £ Q. Then there are constants Ct for 1 < t < T such that for 
any integers N' > N > 0, 



Elk 



t\t-l 



The constants Ct depend on r, but otherwise the bound is uniform in 6. 

for V'^ (61) 



(ds) whenever h is 



Therefore this result implies IjB.l? 
bounded on the state space of (St). 

The requirement of a lower bound r > on rg, uniform in 9 and s, will 
typically be satisfied only if both 9 and the state space of (St) are compact, 
or at least bounded. Boundedness of 6 is as good as implied by Hypothesis Q 
whereas boundedness of the state space is a more serious limitation. Having 
said that we notice that this condition is recurring in the literature on particle 
filters, in particular when treating forgetting properties. 



Proof of Provosition \B. 1[ For any A e B{Q) 



< 



1 ^ ON 1 ^' ON' 



N 



X] -^-4(4-1) ( ^ ^ 



ie.J?A<N 



1 



(B.21) 
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where # denotes cardinality of a set and, in the last step, # Jt was bounded by 
N. We now seek to bound E(#Jj'=). 



Put pt ^ 1- #Jt/N' and define the cr-field Tt = 1 < « < iV) V 



cf{&^i I, ^ 1^ i 1^ N'). Then conditionally on Tt-, is a binomial random 



variable with parameters N and '^i^j^{w^'-'^ A w^'l^ ). Using the definition of 
Jt and abbreviating rg{yt\s) as rtfi(s), we find that 



E 

ieJt 



e,N . 9,N' 



Si<i<Ar ^s,*(^t|t-i,i) 

J2l<i<N^S^ti^t\t-l,i) \ Sl<i<Ar' '^e,*(Ct|t_i,i) , 



f A 



i<N '"».t(«t|t-l,i) 



/ 1 

f A — 



£igjc,i<w''9.t(gf|'t"l.i) \ 



V 



1 



> 



t: , .S.N r 

l^ieJt '■«.*(?t|t-i,,) 

1 



- 1-Pt 
f 



1 



r 



-2. 



-2 Pt 



1-pt 



- 1-pt 

2 =: u{pt). 



We note that as u is convex and decreasing with u{0) = 1, there exists a constant 
C„ > such that u{p) > 1 — C„p. 

The above-mentioned conditional binomial distribution of =ffJt implies, to- 
gether with the above inequality, that E(#Jt+i | !Ft) > Nu{pt), and therefore 



Eip 



•t+i 



Nujpt) 
N' 

u{pt) 



N' -N 

N' 
N' - N 
N' 



u{pt 



< l-u{pt) + 
-■ v{p). 

Applying this inequality recursively, it follows that E(pt) < w°*(po)i where su- 
perindex 'oi' means i-fold function composition. 

We notice that po = {N' — N)/N' and v{p) < CuP+Pa < Cy{p+po) for some 
constant Cy, and by induction we find that there is a constant Cy^t such that 
v°\p) < Cy^p + pa). Thus E(#J,^) = N'E{pt) < 2N'Cy,tPQ = 2Cy,tiN' - N). 
The proof is finished by inserting this bound into the right-hand side of (|B.2ip . 
then taking the supremum over A e V{Q) on the left-hand side and finally the 
expectation. 

□ 
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